Skip to content

Step 3/3 Feat/temporal filters in rag pipeline#284

Open
Ahmath-Gadji wants to merge 3 commits intodevfrom
feat/temporal_filters_in_rag_pipeline
Open

Step 3/3 Feat/temporal filters in rag pipeline#284
Ahmath-Gadji wants to merge 3 commits intodevfrom
feat/temporal_filters_in_rag_pipeline

Conversation

@Ahmath-Gadji
Copy link
Copy Markdown
Collaborator

@Ahmath-Gadji Ahmath-Gadji commented Mar 13, 2026

feat: temporal created_at filters in the RAG query pipeline

Summary

Adds LLM-driven temporal filters to retrieval so users can naturally ask time-bounded questions ("reports from last month", "meeting notes this week") and have them translated into Milvus created_at filters before vector search.

Changes

  • New structured query models in openrag/components/pipeline.py: TemporalPredicate (field/op/value on created_at) and Query (text + AND-combined predicates). SearchQueries now returns a list of Query objects instead of raw strings.
  • RetrieverPipeline.retrieve_docs takes a Query and derives the Milvus filter via Query.to_milvus_filter() — no schema change on the collection.
  • Filterless fallback: when a temporally-filtered search returns zero docs, retry without the filter. Gated by retriever.allow_filterless_fallback (env RETRIEVER_ALLOW_FILTERLESS_FALLBACK, default true) so strict deployments can opt out.
  • Schema-validation retry: query generation retries once on ValidationError / OutputParserException, then falls back to the raw user query. with_structured_output(..., strict=True).
  • Prompt template (prompts/example1/query_contextualizer_tmpl.txt, v1) adds query-decomposition rules, temporal-filter rules with explicit lower/upper-bound semantics, and richer {current_date} (weekday + time) to disambiguate relative expressions. Key distinction: ingestion date → filter; document topic date → no filter.

Benchmarks

Offline harness under benchmarks/prompt_eval/ comparing prompt v0 (pre-change) vs v1 (this PR) on the two production-candidate models. Judge: Qwen3-VL-8B-Instruct-FP8. Full results in benchmarks/prompt_eval/Report_temporal_filter.md and benchmarks/prompt_eval/Report_query_decomposition.md.

Temporal filter detection (40 cases, 20 pos / 20 neg):

Prompt Model Accuracy F1 Filter correctness (TP)
v0 Mistral-Small-3.1-24B 75.0% 66.7% 8/10
v1 Mistral-Small-3.1-24B 100.0% 100.0% 18/20 (90.0%)
v1 Qwen3-VL-8B 94.9% 95.0% 17/19 (89.5%)

Query decomposition (80 cases across D1/D2/D3):

Prompt Model count_match semantic_coverage
v0 Mistral-Small-3.1-24B 82.5% 92.5%
v1 Mistral-Small-3.1-24B 86.2% 95.0%
v1 Qwen3-VL-8B 82.5% 88.8%

Recommended pairing: v1 + Mistral-Small-3.1-24B — 100% filter detection, 90% filter-body correctness, and best decomposition numbers. Residual failures are narrow (e.g. spurious upper bound on "since last Monday", over-wide window on "latest") and documented in the reports.

Backwards compatibility

No collection/schema change. Queries with no temporal reference produce temporal_filters=None and hit the same retrieval path as before.

Summary by CodeRabbit

  • New Features

    • Date-based (temporal) filter support for generated search queries.
    • Retrieval fallback that retries without temporal filters when filtered results are empty.
    • Prompt template enforcing structured (JSON) query outputs with explicit temporal predicates.
  • Documentation

    • Added benchmark reports evaluating query decomposition and temporal-filter generation, with per-tier and per-prompt analyses.
  • Chores

    • New benchmark datasets, evaluation tooling, and project manifest to support prompt evaluation.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

Adds a prompt-eval benchmarking suite (datasets, reports, prompts, CLI evaluators), introduces structured query models with temporal predicates (TemporalPredicate, Query, SearchQueries), updates retrieval to accept Query objects and optionally retry without temporal filters, and wires configuration, tests, and project manifests accordingly.

Changes

Cohort / File(s) Summary
Benchmark project manifest & env
benchmarks/prompt_eval/.env.example, benchmarks/prompt_eval/pyproject.toml
New environment example and pyproject manifest for the benchmark project.
Datasets
benchmarks/prompt_eval/datasets/query_decomposition.json, benchmarks/prompt_eval/datasets/temporal_filter.json
Added 80-case query decomposition dataset and 40-case temporal-filter dataset (labeled positive/negative).
Benchmark reports
benchmarks/prompt_eval/Report_query_decomposition.md, benchmarks/prompt_eval/Report_temporal_filter.md
New benchmark reports describing datasets, metrics, judge setup, results, and qualitative analyses.
Evaluation CLIs
benchmarks/prompt_eval/eval_query_decomposition.py, benchmarks/prompt_eval/eval_temporal_filter_generation.py
New async CLI evaluators: env-driven multi-model orchestration, structured-output parsing, LLM-as-judge calls, per-case metrics, concurrency, and optional JSON result export.
Prompt templates
benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txt, benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt, prompts/example1/...
Added v0/v1 contextualizer prompts (v1 enforces JSON query_list with temporal_filters); updated example/system/spoken templates to include {current_date} and enforce the JSON contract.
Pipeline: query schema & retrieval
openrag/components/pipeline.py
Added Pydantic TemporalPredicate and Query; switched SearchQueries.query_list to list[Query]; updated retriever signatures to accept Query, use Query.to_milvus_filter(), added gated fallback retry without filter, and hardened structured-output parsing with retries/fallbacks.
Config surface
conf/config.yaml, openrag/config/loader.py, openrag/config/models.py, pyproject.toml
Added retriever.allow_filterless_fallback (default true) with env var mapping; updated workspace config to exclude benchmarks/prompt_eval.
Tests / mocks
tests/api_tests/api_run/mock_vllm.py
Improved mock tool-call argument generation: added _resolve_ref() and _mock_value() to recursively construct nested/array/union mock values from JSON Schema.
Other prompts
prompts/example1/query_contextualizer_tmpl.txt, prompts/example1/spoken_style_answer_tmpl.txt, prompts/example1/sys_prompt_tmpl.txt
Converted prompt output to strict JSON schema with temporal-filter rules; added Current date: {current_date} to context sections.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ChatRAG as ChatRAG Pipeline
    participant LLM as Query Generator
    participant Retriever as RetrieverPipeline
    participant Milvus as Milvus DB
    participant Judge as LLM Judge

    User->>ChatRAG: sends last message
    ChatRAG->>LLM: format prompt (current_date, query_language)
    LLM-->>ChatRAG: return SearchQueries (query_list of Query objects)
    ChatRAG->>Retriever: retrieve_docs(Query)
    Retriever->>Retriever: Query.to_milvus_filter() -> filter string
    Retriever->>Milvus: vector search with filter
    Milvus-->>Retriever: results
    alt filtered results empty & allow_filterless_fallback
        Retriever->>Milvus: retry vector search without filter
        Milvus-->>Retriever: results
    end
    Retriever-->>ChatRAG: documents
    ChatRAG->>User: final response with sources

    Note over Judge: (async) Judge evaluates semantic coverage or filter correctness
    ChatRAG->>Judge: send case + generated output
    Judge-->>ChatRAG: verdict (covered/correct + reasoning)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • paultranvan

Poem

🐰 I hopped through prompts and datasets bright,

I carved time-filters in the moonlight,
Queries split and judges chew,
Fallbacks fetch when filters rue,
Benchmarks bloom — a rabbit's delight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly matches the PR's main objective: adding temporal filters to the RAG pipeline. It accurately captures the feature being implemented across the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/temporal_filters_in_rag_pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Ahmath-Gadji Ahmath-Gadji changed the title Feat/temporal filters in rag pipeline 3/3 Feat/temporal filters in rag pipeline Mar 17, 2026
@Ahmath-Gadji Ahmath-Gadji changed the title 3/3 Feat/temporal filters in rag pipeline Step 3/3 Feat/temporal filters in rag pipeline Mar 17, 2026
Comment thread prompts/example1/query_contextualizer_tmpl.txt Outdated
@Ahmath-Gadji Ahmath-Gadji force-pushed the feat/temporal_filters_in_rag_pipeline branch 10 times, most recently from 22fc7d7 to e043548 Compare April 14, 2026 14:55
@Ahmath-Gadji Ahmath-Gadji marked this pull request as ready for review April 14, 2026 15:06
Comment thread utility/data_indexer2.py Outdated
@coderabbitai coderabbitai bot added the feat Add a new feature label Apr 19, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
prompts/example1/spoken_style_answer_tmpl.txt (1)

22-22: ⚠️ Potential issue | 🟡 Minor

Fix typo in user-facing prompt text.

"humain" should be "human".

📝 Proposed fix
-   * Keep your answer succinct and straight to the point as if speaking to humain.
+   * Keep your answer succinct and straight to the point as if speaking to human.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@prompts/example1/spoken_style_answer_tmpl.txt` at line 22, Fix the typo in
the user-facing prompt template: in the prompt string that begins "Keep your
answer succinct and straight to the point as if speaking to humain." (in
spoken_style_answer_tmpl.txt / the "Keep your answer..." prompt), change
"humain" to "human" so the sentence reads "...speaking to human."
openrag/components/pipeline.py (1)

294-301: ⚠️ Potential issue | 🔴 Critical

Fix web search calls to pass query string instead of Query object.

At lines 297 and 320, web_search_service.search() receives a Query Pydantic model instead of the expected str. This silently calls Query.__str__(), which returns f"Query: {self.query}, Filter: {self.to_milvus_filter()}", polluting the search engine with literal tokens ("Query:", "Filter:") and Milvus filter expressions, significantly degrading result quality.

Fix
-            web_tasks = [self.web_search_service.search(q) for q in queries.query_list]
+            web_tasks = [self.web_search_service.search(q.query) for q in queries.query_list]

Apply the same fix at line 320 (web-only branch).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openrag/components/pipeline.py` around lines 294 - 301, The web search is
being called with a Query Pydantic model instead of the raw string, which causes
Query.__str__() to pollute the search; update the web search invocations to pass
the actual query string (use q.query) rather than the Query object: change the
web_tasks comprehension that builds from queries.query_list to call
self.web_search_service.search(q.query), and make the identical change in the
web-only branch where web_tasks is constructed (the second occurrence near the
other web-only logic).
♻️ Duplicate comments (1)
utility/data_indexer2.py (1)

1-121: 🛠️ Refactor suggestion | 🟠 Major

This script appears to duplicate data_indexer.py and was reportedly not intended for upstream.

Per prior discussion on this PR, you mentioned this file was only for local use and not meant to be pushed. It also largely duplicates utility/data_indexer.py with added PDF-only filtering, a 100-file cap, page-count bounds (11–40), and size/page statistics. If it is intentionally kept, please:

  • Rename it to reflect its purpose (e.g., pdf_benchmark_indexer.py) rather than data_indexer2.
  • Fold the shared logic (arg parsing, __check_api, upload loop) into a common helper to avoid drift between the two scripts.
  • Document the rationale for the hardcoded limit = 100 and page bounds (10, 40] — these look benchmark-specific.

Otherwise, consider dropping it from this PR since it is orthogonal to the temporal-filters feature.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@utility/data_indexer2.py` around lines 1 - 121, This file duplicates existing
data_indexer.py and shouldn't be upstream as-is; either remove it from the PR or
rename it to reflect purpose (e.g., pdf_benchmark_indexer.py) and refactor
common logic (arg parsing, __check_api, upload loop) into a shared helper module
to prevent drift; if you keep it, add a short docstring near the top declaring
its benchmark intent and justify the hard-coded limit = 100 and the page bounds
check (the page_count <= 10 or page_count > 40 filter) so reviewers know they
are intentional.
🧹 Nitpick comments (5)
utility/data_indexer.py (1)

36-46: __check_api does not fail on non-200 responses.

If the health check returns e.g. 500 or 404, the function silently returns and the script proceeds to index. Only httpx.RequestError (connection-level) is raised. Consider response.raise_for_status() or an explicit non-200 branch so a misconfigured URL/auth fails fast.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@utility/data_indexer.py` around lines 36 - 46, The __check_api function
currently only raises on httpx.RequestError and silently continues on non-200
responses; modify __check_api to detect non-200 HTTP responses (e.g., call
response.raise_for_status() or explicitly check response.status_code != 200) and
raise an exception (or log and raise) so the caller fails fast; update the
try/except around httpx.get in __check_api to include HTTP status errors (using
response.raise_for_status() is preferred) so any 4xx/5xx result causes the
function to raise instead of returning silently.
benchmarks/prompt_eval/Report_query_decomposition.md (1)

50-63: Optional: Consider varying sentence structure for readability.

The static analysis tool flags repetitive sentence openings in the "Common hard cases" section. While not critical, varying the structure could improve readability.

✍️ Optional rewording suggestion
 ### Common hard cases (both models, both prompts)
 
-- **id 30** — "Compare AWS Lambda and Google Cloud Functions" stays as a single comparison query.
-- **id 63** — "Kafka vs RabbitMQ": earlier gRPC turn in the history contaminates the reformulation.
-- **id 70** — "air freight vs sea freight": emitted as a single comparison instead of two independent lookups.
-- **id 74** — "EASA vs FAA certification": same pattern.
+- **id 30** — "Compare AWS Lambda and Google Cloud Functions" stays as a single comparison query.
+- **id 63** — "Kafka vs RabbitMQ": earlier gRPC turn in the history contaminates the reformulation.
+- For **id 70** ("air freight vs sea freight"), the model emits a single comparison instead of two independent lookups.
+- **id 74** ("EASA vs FAA certification") exhibits the same pattern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/prompt_eval/Report_query_decomposition.md` around lines 50 - 63,
The "Common hard cases (both models, both prompts)" bullet list uses repetitive
sentence openings and punctuation; please rewrite that section (the header and
the four bullets referencing id 30, id 63, id 70, id 74) to vary sentence
structure and lead-ins for readability — e.g., turn some bullets into full
sentences starting with the case description ("Compare AWS Lambda and Google
Cloud Functions") and others starting with the observed failure ("Earlier gRPC
turn..."), replace repeated em-dash openings with varied connectors or clauses,
and ensure each item still mentions the id and the failure mode (single
comparison, contaminated reformulation, under-splitting) without changing
meaning.
benchmarks/prompt_eval/eval_temporal_filter_generation.py (1)

55-230: Extract shared infra with eval_query_decomposition.py to prevent drift.

_parse_env_list, _build_models, _judge_config, _model_kwargs, build_llm_messages, format_prompt, the TemporalPredicate/Query/SearchQueries schemas, and the CLI path-sandboxing logic are duplicated almost verbatim between the two evaluators. Any future change (e.g. adding strict=True, switching to UTC, changing the default max_completion_tokens) will need to be applied in both places, and already-present asymmetries will grow. Consider extracting a benchmarks/prompt_eval/_common.py (or lib/) module.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/prompt_eval/eval_temporal_filter_generation.py` around lines 55 -
230, Extract duplicated utilities and schemas into a shared module (e.g.,
benchmarks/prompt_eval/_common.py): move _parse_env_list, _build_models,
_judge_config, _model_kwargs, build_llm_messages, format_prompt and the Pydantic
models TemporalPredicate, Query, SearchQueries (and any CLI path-sandboxing
helpers) into that module; then import and use them from
eval_temporal_filter_generation.py (and eval_query_decomposition.py) so both
files reference the single implementation; ensure imports/exports preserve names
and behavior (including default values like max_completion_tokens and any
langdetect fallback), run lints/tests, and update any relative references to
MODELS or judge_config usage to use the shared module.
benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt (1)

31-42: Tighten the past N/bare MONTH rules — the formulas drop the time unit.

Line 37 writes [Current − N, tomorrow) without carrying over the unit (days/weeks/months/years), so a model parsing "past 2 weeks" could read the lower bound as Current − 2. Line 41 ("nearest past occurrence") is ambiguous when the bare month equals the current month or is in the future (e.g. "December" queried in April 2026 — Dec 2025 or Dec 2026?). Consider:

✏️ Proposed clarification
-- past N days/weeks/months/years → `[Current − N, tomorrow)`
+- past N days/weeks/months/years → `[Current − N <same unit>, tomorrow)` (e.g. "past 2 weeks" → `[Current − 14 days, tomorrow)`)
 - recent / latest → treat as "past 90 days"
 - since X → one predicate `>= X`
 - before X → one predicate `< X`
-- bare MONTH (no year) → nearest past occurrence
+- bare MONTH (no year) → the most recent fully-elapsed occurrence of that month (if the named month equals the current month, use the occurrence one year prior)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt` around lines
31 - 42, The "past N" and "bare MONTH" rules are ambiguous: change the "past N
days/weeks/months/years → `[Current − N, tomorrow)`" line to explicitly carry
the time unit (e.g., "past N days/weeks/months/years → `[Current − N
{days|weeks|months|years}, tomorrow)`") and ensure "weeks" and "months" are
interpreted as whole-week and whole-month arithmetic (not unitless subtraction).
Clarify the "bare MONTH (no year) → nearest past occurrence" rule by specifying
deterministic behavior: if MONTH is before the current month use MONTH of the
current year; if MONTH is the current month or after, use MONTH of the previous
year (i.e., the most recent calendar occurrence strictly before today). Ensure
these exact phrases (the "past N days/weeks/months/years" rule and the "bare
MONTH (no year)" rule) are updated so models include the unit and the
tie-breaker rule unambiguously.
openrag/components/pipeline.py (1)

73-77: The double-quote syntax is correct; single and double quotes are both valid in Milvus TIMESTAMPTZ filter expressions.

The codebase's use of double quotes in ISO "2026-03-15T00:00:00+00:00" is valid. Milvus's expression parser accepts both single and double quotes for string literals, including ISO timestamp literals, and treats them equivalently. This is consistent across the codebase (tests, docstrings, and the to_milvus_filter() method).

However, the broader design note stands: TemporalPredicate.operator permits ["==", "!=", ">", "<", ">=", "<="], but all tested and documented filter cases use only ordered comparisons (<, >, >=, <=). Equality and inequality checks on TIMESTAMPTZ fields are semantically unusual. If equality comparisons are not intentional, consider narrowing the Literal to exclude "==" and "!=".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openrag/components/pipeline.py` around lines 73 - 77, The TemporalPredicate
currently allows equality operators but we only use ordered comparisons for
TIMESTAMPTZ; update the type/definition for the TemporalPredicate.operator (the
Literal used where TemporalPredicate is declared) to remove "==" and "!=" so
only "<", ">", "<=", ">=" are permitted; keep the to_milvus_filter() behavior
(it can continue to emit ISO "..." strings) and ensure any type hints or
validations referencing TemporalPredicate.operator (e.g., construction sites or
validators) are updated to the narrower set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/prompt_eval/eval_query_decomposition.py`:
- Around line 1-32: The default dataset filename is misspelled as
"query_decompostion" in multiple places; update every occurrence to
"query_decomposition" — specifically fix the docstring references and the
argparse default for the --dataset option (the string
"datasets/query_decompostion.json" and any plain "query_decompostion"
occurrences) so that the script loads datasets/query_decomposition.json by
default and no longer raises FileNotFoundError when run without --dataset.

In `@openrag/components/pipeline.py`:
- Around line 112-122: The current silent fallback that re-runs
self.retriever.retrieve without filters when docs is empty should be made
explicit and configurable: add a pipeline config flag (e.g.,
allow_filterless_fallback) and only perform the unfiltered re-query when that
flag is True; when you do fallback, set a metadata flag (e.g.,
filter_dropped=True) on the returned result (or attach to the docs/container) so
callers can surface that the temporal filter was dropped; replace the
logger.debug call with logger.warning or logger.info so it’s visible at INFO
level and include context (query=query.query, partition, reason); keep the
original retriever call parameters (partition, query.query, filter_params) when
not falling back and ensure all code paths return the same result shape
including the new metadata key.

In `@utility/data_indexer.py`:
- Around line 57-60: The deterministic file_id generation (absolute_path ->
file_id) now causes duplicate POSTs to url =
f"{args.url}/indexer/partition/{args.partition}/file/{file_id}" to return HTTP
409 and break reruns; update the uploader call in utility/data_indexer.py (the
block that posts to url) to treat 409 as non-fatal: catch the HTTP response or
exception from the POST, and if response.status_code == 409 (or the raised
HTTPError maps to 409), log an informative message and continue (skip that file)
instead of raising; otherwise preserve existing error handling for other status
codes so real failures still surface.

In `@utility/data_indexer2.py`:
- Around line 63-68: The try/except around fitz.open(file_path) currently
swallows all exceptions; change it to except Exception as e and log the failure
before falling back to page_count = 0 (e.g. logger.warning or logger.exception)
so operators can see why a PDF was skipped; update the block that calls
fitz.open(file_path) and sets page_count to capture the exception object (e) and
include file_path and e in the log message.

---

Outside diff comments:
In `@openrag/components/pipeline.py`:
- Around line 294-301: The web search is being called with a Query Pydantic
model instead of the raw string, which causes Query.__str__() to pollute the
search; update the web search invocations to pass the actual query string (use
q.query) rather than the Query object: change the web_tasks comprehension that
builds from queries.query_list to call self.web_search_service.search(q.query),
and make the identical change in the web-only branch where web_tasks is
constructed (the second occurrence near the other web-only logic).

In `@prompts/example1/spoken_style_answer_tmpl.txt`:
- Line 22: Fix the typo in the user-facing prompt template: in the prompt string
that begins "Keep your answer succinct and straight to the point as if speaking
to humain." (in spoken_style_answer_tmpl.txt / the "Keep your answer..."
prompt), change "humain" to "human" so the sentence reads "...speaking to
human."

---

Duplicate comments:
In `@utility/data_indexer2.py`:
- Around line 1-121: This file duplicates existing data_indexer.py and shouldn't
be upstream as-is; either remove it from the PR or rename it to reflect purpose
(e.g., pdf_benchmark_indexer.py) and refactor common logic (arg parsing,
__check_api, upload loop) into a shared helper module to prevent drift; if you
keep it, add a short docstring near the top declaring its benchmark intent and
justify the hard-coded limit = 100 and the page bounds check (the page_count <=
10 or page_count > 40 filter) so reviewers know they are intentional.

---

Nitpick comments:
In `@benchmarks/prompt_eval/eval_temporal_filter_generation.py`:
- Around line 55-230: Extract duplicated utilities and schemas into a shared
module (e.g., benchmarks/prompt_eval/_common.py): move _parse_env_list,
_build_models, _judge_config, _model_kwargs, build_llm_messages, format_prompt
and the Pydantic models TemporalPredicate, Query, SearchQueries (and any CLI
path-sandboxing helpers) into that module; then import and use them from
eval_temporal_filter_generation.py (and eval_query_decomposition.py) so both
files reference the single implementation; ensure imports/exports preserve names
and behavior (including default values like max_completion_tokens and any
langdetect fallback), run lints/tests, and update any relative references to
MODELS or judge_config usage to use the shared module.

In `@benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt`:
- Around line 31-42: The "past N" and "bare MONTH" rules are ambiguous: change
the "past N days/weeks/months/years → `[Current − N, tomorrow)`" line to
explicitly carry the time unit (e.g., "past N days/weeks/months/years →
`[Current − N {days|weeks|months|years}, tomorrow)`") and ensure "weeks" and
"months" are interpreted as whole-week and whole-month arithmetic (not unitless
subtraction). Clarify the "bare MONTH (no year) → nearest past occurrence" rule
by specifying deterministic behavior: if MONTH is before the current month use
MONTH of the current year; if MONTH is the current month or after, use MONTH of
the previous year (i.e., the most recent calendar occurrence strictly before
today). Ensure these exact phrases (the "past N days/weeks/months/years" rule
and the "bare MONTH (no year)" rule) are updated so models include the unit and
the tie-breaker rule unambiguously.

In `@benchmarks/prompt_eval/Report_query_decomposition.md`:
- Around line 50-63: The "Common hard cases (both models, both prompts)" bullet
list uses repetitive sentence openings and punctuation; please rewrite that
section (the header and the four bullets referencing id 30, id 63, id 70, id 74)
to vary sentence structure and lead-ins for readability — e.g., turn some
bullets into full sentences starting with the case description ("Compare AWS
Lambda and Google Cloud Functions") and others starting with the observed
failure ("Earlier gRPC turn..."), replace repeated em-dash openings with varied
connectors or clauses, and ensure each item still mentions the id and the
failure mode (single comparison, contaminated reformulation, under-splitting)
without changing meaning.

In `@openrag/components/pipeline.py`:
- Around line 73-77: The TemporalPredicate currently allows equality operators
but we only use ordered comparisons for TIMESTAMPTZ; update the type/definition
for the TemporalPredicate.operator (the Literal used where TemporalPredicate is
declared) to remove "==" and "!=" so only "<", ">", "<=", ">=" are permitted;
keep the to_milvus_filter() behavior (it can continue to emit ISO "..." strings)
and ensure any type hints or validations referencing TemporalPredicate.operator
(e.g., construction sites or validators) are updated to the narrower set.

In `@utility/data_indexer.py`:
- Around line 36-46: The __check_api function currently only raises on
httpx.RequestError and silently continues on non-200 responses; modify
__check_api to detect non-200 HTTP responses (e.g., call
response.raise_for_status() or explicitly check response.status_code != 200) and
raise an exception (or log and raise) so the caller fails fast; update the
try/except around httpx.get in __check_api to include HTTP status errors (using
response.raise_for_status() is preferred) so any 4xx/5xx result causes the
function to raise instead of returning silently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 395f8e6d-4bcc-412e-9833-038603131849

📥 Commits

Reviewing files that changed from the base of the PR and between 922714e and 7d30ac3.

⛔ Files ignored due to path filters (1)
  • benchmarks/prompt_eval/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • benchmarks/prompt_eval/.env.example
  • benchmarks/prompt_eval/Report_query_decomposition.md
  • benchmarks/prompt_eval/Report_temporal_filter.md
  • benchmarks/prompt_eval/datasets/query_decomposition.json
  • benchmarks/prompt_eval/datasets/temporal_filter.json
  • benchmarks/prompt_eval/eval_query_decomposition.py
  • benchmarks/prompt_eval/eval_temporal_filter_generation.py
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txt
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt
  • benchmarks/prompt_eval/pyproject.toml
  • benchmarks/prompt_eval/results/result_filter_generation.json
  • benchmarks/prompt_eval/results/result_query_decomposition.json
  • openrag/components/pipeline.py
  • prompts/example1/query_contextualizer_tmpl.txt
  • prompts/example1/spoken_style_answer_tmpl.txt
  • prompts/example1/sys_prompt_tmpl.txt
  • pyproject.toml
  • tests/api_tests/api_run/mock_vllm.py
  • utility/data_indexer.py
  • utility/data_indexer2.py

Comment thread benchmarks/prompt_eval/eval_query_decomposition.py
Comment thread openrag/components/pipeline.py Outdated
Comment thread utility/data_indexer.py Outdated
Comment thread utility/data_indexer2.py Outdated
@EnjoyBacon7 EnjoyBacon7 force-pushed the feat/temporal_filters_in_rag_pipeline branch from 7d30ac3 to 40ed25d Compare April 19, 2026 18:00
Base automatically changed from dev to main April 20, 2026 15:26
@EnjoyBacon7 EnjoyBacon7 changed the base branch from main to dev April 21, 2026 07:15
@EnjoyBacon7 EnjoyBacon7 force-pushed the feat/temporal_filters_in_rag_pipeline branch from 40ed25d to 240f1c1 Compare April 21, 2026 07:29
@coderabbitai coderabbitai bot added the breaking-change Change of behavior after upgrade label Apr 21, 2026
@Ahmath-Gadji Ahmath-Gadji removed the breaking-change Change of behavior after upgrade label Apr 21, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
benchmarks/prompt_eval/eval_query_decomposition.py (1)

335-355: Bound concurrent LLM calls to avoid rate-limit-skewed results.

asyncio.as_completed(tasks) schedules every case immediately. With 80 cases, each potentially doing a generator call and a judge call, transient rate limits can dominate the report instead of prompt quality. Consider wrapping run_case() with an asyncio.Semaphore and exposing a --concurrency option.

Minimal bounded-concurrency shape
+    semaphore = asyncio.Semaphore(8)
+
+    async def run_limited_case(case: dict) -> CaseResult:
+        async with semaphore:
+            return await run_case(
+                case,
+                prompt_template,
+                query_generator,
+                model_base_url,
+                coverage_judge,
+                judge_base_url,
+            )
+
     tasks = [
-        run_case(
-            case,
-            prompt_template,
-            query_generator,
-            model_base_url,
-            coverage_judge,
-            judge_base_url,
-        )
+        run_limited_case(case)
         for case in dataset
     ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/prompt_eval/eval_query_decomposition.py` around lines 335 - 355,
The loop currently schedules every run_case concurrently via
asyncio.as_completed which can flood LLM endpoints; introduce a concurrency
limiter by creating an asyncio.Semaphore (configurable via a new --concurrency
CLI option) and wrap each run_case invocation in an async wrapper that acquires
the semaphore before calling run_case(case, ...) and releases it after, then
schedule those wrapper coroutines instead of raw run_case in tasks; keep the
tqdm/as_completed consumption the same but ensure tasks are created from the
semaphore-wrapped coroutine to bound simultaneous generator/judge calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/prompt_eval/.env.example`:
- Around line 2-4: The env var names BASE_URLs, API_KEYs, and MODELs in
.env.example are mixed-case and trigger dotenv-linter; either normalize them to
uppercase across all usages or add a linter exception: Option A — rename the
three variables to BASE_URLS, API_KEYS, and MODELS and update every reference in
eval_query_decomposition.py (references to BASE_URLs/API_KEYs/MODELs) and
eval_temporal_filter_generation.py (same three references) so they match; Option
B — add a .dotenv-linter.json rule to ignore these specific keys (BASE_URLs,
API_KEYs, MODELs) to suppress the warning if you want to keep the current names.

In `@benchmarks/prompt_eval/eval_query_decomposition.py`:
- Around line 192-204: format_prompt currently uses datetime.now(), making
outputs non-deterministic; change format_prompt to accept a reference_date (or
read a module-level DATASET_REFERENCE_DATE set to 2026-04-17) and use that date
to produce current_date (e.g., reference_date.strftime(...)) instead of
datetime.now(); update the other place that formats current_date (the similar
block at lines ~255-258) to also use the same reference_date or constant so all
relative-date prompt injections are deterministic and reproducible.

In `@openrag/config/models.py`:
- Line 422: Change the default for temporal retrieval to strict by setting the
Allow Filterless Fallback flag to False: update the attribute
allow_filterless_fallback in the ModelsConfig (symbol:
allow_filterless_fallback) from True to False and mirror the same default in
conf/config.yaml so both runtime model config and the YAML default are
consistent; ensure any documentation or initialization logic that reads
allow_filterless_fallback continues to work with the new False default.

In `@prompts/example1/query_contextualizer_tmpl.txt`:
- Around line 26-29: The example that maps content dates to a created_at filter
is incorrect: only emit temporal_filters (with field="created_at") when the user
explicitly refers to document creation/publication/written times; for user
requests about events/meetings held in a time window leave temporal_filters null
(or emit a separate event-time predicate if your schema supports it). Update the
example that rewrites “meetings held in the past month” so it does not produce
created_at predicates, and apply the same correction to the other similar
example that currently emits created_at; keep the half-open interval rule and
ISO8601 format for true created_at predicates.

---

Nitpick comments:
In `@benchmarks/prompt_eval/eval_query_decomposition.py`:
- Around line 335-355: The loop currently schedules every run_case concurrently
via asyncio.as_completed which can flood LLM endpoints; introduce a concurrency
limiter by creating an asyncio.Semaphore (configurable via a new --concurrency
CLI option) and wrap each run_case invocation in an async wrapper that acquires
the semaphore before calling run_case(case, ...) and releases it after, then
schedule those wrapper coroutines instead of raw run_case in tasks; keep the
tqdm/as_completed consumption the same but ensure tasks are created from the
semaphore-wrapped coroutine to bound simultaneous generator/judge calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7464efc4-a5bd-4fad-871a-dd7b5cdb4605

📥 Commits

Reviewing files that changed from the base of the PR and between 7d30ac3 and 240f1c1.

⛔ Files ignored due to path filters (1)
  • benchmarks/prompt_eval/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (21)
  • benchmarks/prompt_eval/.env.example
  • benchmarks/prompt_eval/Report_query_decomposition.md
  • benchmarks/prompt_eval/Report_temporal_filter.md
  • benchmarks/prompt_eval/datasets/query_decomposition.json
  • benchmarks/prompt_eval/datasets/temporal_filter.json
  • benchmarks/prompt_eval/eval_query_decomposition.py
  • benchmarks/prompt_eval/eval_temporal_filter_generation.py
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txt
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt
  • benchmarks/prompt_eval/pyproject.toml
  • benchmarks/prompt_eval/results/result_filter_generation.json
  • benchmarks/prompt_eval/results/result_query_decomposition.json
  • conf/config.yaml
  • openrag/components/pipeline.py
  • openrag/config/loader.py
  • openrag/config/models.py
  • prompts/example1/query_contextualizer_tmpl.txt
  • prompts/example1/spoken_style_answer_tmpl.txt
  • prompts/example1/sys_prompt_tmpl.txt
  • pyproject.toml
  • tests/api_tests/api_run/mock_vllm.py
✅ Files skipped from review due to trivial changes (9)
  • prompts/example1/spoken_style_answer_tmpl.txt
  • prompts/example1/sys_prompt_tmpl.txt
  • pyproject.toml
  • benchmarks/prompt_eval/pyproject.toml
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txt
  • benchmarks/prompt_eval/datasets/temporal_filter.json
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt
  • benchmarks/prompt_eval/Report_temporal_filter.md
  • benchmarks/prompt_eval/eval_temporal_filter_generation.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/api_tests/api_run/mock_vllm.py
  • benchmarks/prompt_eval/datasets/query_decomposition.json
  • openrag/components/pipeline.py

Comment thread benchmarks/prompt_eval/.env.example Outdated
Comment thread benchmarks/prompt_eval/eval_query_decomposition.py
Comment thread openrag/config/models.py
Comment on lines +26 to +29
# Temporal filters
Emit `temporal_filters` when the user refers to WHEN a document was created / published / written. Leave null if the date only describes the topic.
Each predicate: `field="created_at"`, `operator` ∈ {{`>=`, `>`, `<=`, `<`}}, `value` = ISO 8601 UTC.
Predicates are AND-combined. Always use half-open intervals: `>= start` AND `< next_start`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the example that converts a content date into a created_at filter.

Line 27 says filters are only for document creation/publication/writing time, but the example rewrites the request as meetings held in the past month and still emits created_at predicates. That will bias the model to over-filter content-date queries.

Prompt-safe fix
-User: "Summary of meetings in the past month"
-{{"query_list":[{{"query":"Summary of meetings held in the past month","temporal_filters":[{{"field":"created_at","operator":">=","value":"2026-03-15T00:00:00+00:00"}},{{"field":"created_at","operator":"<","value":"2026-04-16T00:00:00+00:00"}}]}}]}}
+User: "Summary of meeting notes uploaded in the past month"
+{{"query_list":[{{"query":"Summary of meeting notes uploaded in the past month","temporal_filters":[{{"field":"created_at","operator":">=","value":"2026-03-15T00:00:00+00:00"}},{{"field":"created_at","operator":"<","value":"2026-04-16T00:00:00+00:00"}}]}}]}}

Also applies to: 48-49

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@prompts/example1/query_contextualizer_tmpl.txt` around lines 26 - 29, The
example that maps content dates to a created_at filter is incorrect: only emit
temporal_filters (with field="created_at") when the user explicitly refers to
document creation/publication/written times; for user requests about
events/meetings held in a time window leave temporal_filters null (or emit a
separate event-time predicate if your schema supports it). Update the example
that rewrites “meetings held in the past month” so it does not produce
created_at predicates, and apply the same correction to the other similar
example that currently emits created_at; keep the half-open interval rule and
ISO8601 format for true created_at predicates.

@coderabbitai coderabbitai bot added the breaking-change Change of behavior after upgrade label Apr 21, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openrag/components/pipeline.py (1)

303-342: ⚠️ Potential issue | 🔴 Critical

Pass q.query (string) to web search instead of Query object.

At lines 312 and 335, queries.query_list contains Query pydantic objects, but self.web_search_service.search() expects a string parameter. Passing the Query object will stringify it to "Query: ..., Filter: ..." via __str__(), which is not a valid search term and will break web search.

🔧 Suggested fix
-            web_tasks = [self.web_search_service.search(q) for q in queries.query_list]
+            web_tasks = [self.web_search_service.search(q.query) for q in queries.query_list]
-            raw_web_lists = await asyncio.gather(*[self.web_search_service.search(q) for q in queries.query_list])
+            raw_web_lists = await asyncio.gather(
+                *[self.web_search_service.search(q.query) for q in queries.query_list]
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openrag/components/pipeline.py` around lines 303 - 342, The web search is
being called with Query objects from queries.query_list causing invalid search
terms; update both places where self.web_search_service.search is invoked (the
web_tasks comprehension and the inline search in web-only mode) to pass the raw
string field (q.query) instead of the Query object q so search receives a plain
string; ensure any variable names (e.g., web_tasks, raw_web_lists) remain
compatible with this change.
🧹 Nitpick comments (1)
openrag/components/pipeline.py (1)

115-128: Filterless fallback now gated + warned — addresses prior feedback.

Config flag (allow_filterless_fallback) plus logger.warning with query/filter/partition context correctly resolves the earlier concern about silent temporal-filter dropping. Consider also surfacing a filter_dropped signal on the response metadata so the UI can warn the user that results no longer respect the requested date range; not a blocker.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openrag/components/pipeline.py` around lines 115 - 128, When the code falls
back to filterless retrieval (the if block checking not docs and milvus_filter
and self.allow_filterless_fallback), add a machine-readable flag (e.g.,
filter_dropped: True) into the response metadata so callers/UIs can detect that
results no longer respect the original temporal filter; set this flag on
whatever response container you return from this pipeline (or on each
Document/DocsMetadata) immediately after docs = await
self.retriever.retrieve(...), using the same context (query.query, partition,
milvus_filter) so downstream code can surface a warning. Ensure the flag is only
present when this fallback path executes and preserved in the pipeline’s output
metadata structure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openrag/components/pipeline.py`:
- Around line 74-78: The Milvus filter builder in to_milvus_filter currently
emits expressions like ISO "..." which Milvus boolean grammar doesn't accept;
update the format in the list comprehension inside to_milvus_filter (where parts
is built from self.temporal_filters) to drop the "ISO " prefix and emit just the
quoted timestamp (e.g., f'{p.field} {p.operator} "{p.value}"'), then join parts
as before and return the resulting string.

---

Outside diff comments:
In `@openrag/components/pipeline.py`:
- Around line 303-342: The web search is being called with Query objects from
queries.query_list causing invalid search terms; update both places where
self.web_search_service.search is invoked (the web_tasks comprehension and the
inline search in web-only mode) to pass the raw string field (q.query) instead
of the Query object q so search receives a plain string; ensure any variable
names (e.g., web_tasks, raw_web_lists) remain compatible with this change.

---

Nitpick comments:
In `@openrag/components/pipeline.py`:
- Around line 115-128: When the code falls back to filterless retrieval (the if
block checking not docs and milvus_filter and self.allow_filterless_fallback),
add a machine-readable flag (e.g., filter_dropped: True) into the response
metadata so callers/UIs can detect that results no longer respect the original
temporal filter; set this flag on whatever response container you return from
this pipeline (or on each Document/DocsMetadata) immediately after docs = await
self.retriever.retrieve(...), using the same context (query.query, partition,
milvus_filter) so downstream code can surface a warning. Ensure the flag is only
present when this fallback path executes and preserved in the pipeline’s output
metadata structure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8e1f2b90-06ba-47ce-82cf-e919741b981a

📥 Commits

Reviewing files that changed from the base of the PR and between 240f1c1 and 4f47041.

📒 Files selected for processing (1)
  • openrag/components/pipeline.py

Comment thread openrag/components/pipeline.py
Comment thread benchmarks/prompt_eval/.env.example Outdated
Comment thread openrag/components/pipeline.py
Comment thread openrag/components/pipeline.py Outdated
Comment thread openrag/components/pipeline.py
@paultranvan paultranvan removed the breaking-change Change of behavior after upgrade label Apr 21, 2026
@EnjoyBacon7 EnjoyBacon7 force-pushed the feat/temporal_filters_in_rag_pipeline branch from 4f47041 to 25015db Compare April 21, 2026 16:39
@EnjoyBacon7 EnjoyBacon7 force-pushed the feat/temporal_filters_in_rag_pipeline branch from 25015db to 0485e59 Compare April 21, 2026 16:54
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
openrag/components/pipeline.py (1)

211-273: Add ValidationError and OutputParserException to exceptions_to_handle in the fallback chain.

The current setup only triggers the function_calling fallback on openai.BadRequestError. However, when json_schema output fails Pydantic validation (raising ValidationError) or parsing (raising OutputParserException), the outer retry loop catches these exceptions but attempts the same json_schema path again on the second try, never exercising the fallback. Adding these exceptions to exceptions_to_handle ensures the fallback actually gets a chance to succeed on schema validation failures:

Change
         self.query_generator = primary.with_fallbacks(
             [fallback],
-            exceptions_to_handle=(openai.BadRequestError,),
+            exceptions_to_handle=(openai.BadRequestError, ValidationError, OutputParserException),
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openrag/components/pipeline.py` around lines 211 - 273, The fallback chain
currently only handles openai.BadRequestError so schema/parse failures never
trigger the function_calling fallback; update the with_fallbacks call that
builds self.query_generator (primary, fallback, exceptions_to_handle) to also
include ValidationError and OutputParserException alongside
openai.BadRequestError so that schema validation/parsing errors will invoke the
fallback path (references: primary, fallback, self.query_generator,
with_fallbacks, exceptions_to_handle, ValidationError, OutputParserException,
openai.BadRequestError).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/prompt_eval/eval_temporal_filter_generation.py`:
- Around line 128-139: The benchmark's TemporalPredicate model allows "==" and
"!=" but production's TemporalPredicate (openrag/components/pipeline.py) only
permits ">", "<", ">=", "<="; update the operator Literal in TemporalPredicate
in this file (class TemporalPredicate, field operator) to match production by
removing "==" and "!=" so it becomes Literal[">", "<", ">=", "<="], and make the
identical change to the copy in eval_query_decomposition.py so both benchmarks
mirror production validation and scoring.

In `@tests/api_tests/api_run/mock_vllm.py`:
- Around line 238-264: Handle enum/const and prevent infinite recursion: update
_mock_value to check for "const" and "enum" (return schema["const"] if present;
for "enum" return the first enum value that matches the expected type) before
the type branches so constrained strings (e.g., TemporalPredicate.op) are mocked
correctly; in _resolve_ref replace lstrip("#/") with removeprefix("#/") to
correctly strip JSON Pointer prefixes; add a simple cycle guard by threading a
visited set (e.g., visited_ids) through _resolve_ref and _mock_value to detect
and break recursive $ref loops (return None or a safe default when a cycle is
seen). Ensure you reference the existing function names _mock_value and
_resolve_ref when making changes.

---

Nitpick comments:
In `@openrag/components/pipeline.py`:
- Around line 211-273: The fallback chain currently only handles
openai.BadRequestError so schema/parse failures never trigger the
function_calling fallback; update the with_fallbacks call that builds
self.query_generator (primary, fallback, exceptions_to_handle) to also include
ValidationError and OutputParserException alongside openai.BadRequestError so
that schema validation/parsing errors will invoke the fallback path (references:
primary, fallback, self.query_generator, with_fallbacks, exceptions_to_handle,
ValidationError, OutputParserException, openai.BadRequestError).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ee251a91-f7f7-4536-8d70-5508ff06add8

📥 Commits

Reviewing files that changed from the base of the PR and between 4f47041 and 0485e59.

⛔ Files ignored due to path filters (1)
  • benchmarks/prompt_eval/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (21)
  • benchmarks/prompt_eval/.env.example
  • benchmarks/prompt_eval/Report_query_decomposition.md
  • benchmarks/prompt_eval/Report_temporal_filter.md
  • benchmarks/prompt_eval/datasets/query_decomposition.json
  • benchmarks/prompt_eval/datasets/temporal_filter.json
  • benchmarks/prompt_eval/eval_query_decomposition.py
  • benchmarks/prompt_eval/eval_temporal_filter_generation.py
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txt
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt
  • benchmarks/prompt_eval/pyproject.toml
  • benchmarks/prompt_eval/results/result_filter_generation.json
  • benchmarks/prompt_eval/results/result_query_decomposition.json
  • conf/config.yaml
  • openrag/components/pipeline.py
  • openrag/config/loader.py
  • openrag/config/models.py
  • prompts/example1/query_contextualizer_tmpl.txt
  • prompts/example1/spoken_style_answer_tmpl.txt
  • prompts/example1/sys_prompt_tmpl.txt
  • pyproject.toml
  • tests/api_tests/api_run/mock_vllm.py
✅ Files skipped from review due to trivial changes (7)
  • prompts/example1/sys_prompt_tmpl.txt
  • prompts/example1/spoken_style_answer_tmpl.txt
  • pyproject.toml
  • benchmarks/prompt_eval/pyproject.toml
  • benchmarks/prompt_eval/datasets/temporal_filter.json
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txt
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt
🚧 Files skipped from review as they are similar to previous changes (4)
  • openrag/config/models.py
  • conf/config.yaml
  • benchmarks/prompt_eval/datasets/query_decomposition.json
  • prompts/example1/query_contextualizer_tmpl.txt

Comment on lines +128 to +139
class TemporalPredicate(BaseModel):
field: Literal["created_at"] = Field(default="created_at")
operator: Literal["==", "!=", ">", "<", ">=", "<="]
value: str = Field(description='ISO 8601 datetime with timezone, e.g. "2026-03-15T00:00:00+00:00".')


class Query(BaseModel):
query: str = Field(description="A semantically enriched, descriptive query for vector similarity search.")
temporal_filters: list[TemporalPredicate] | None = Field(
default=None,
description="Date predicates on created_at, AND-combined. Null when no creation-date restriction.",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Operator set diverges from production TemporalPredicate — benchmark may score predicates production would reject.

Pipeline's TemporalPredicate (openrag/components/pipeline.py lines 52-54) restricts operator to Literal[">", "<", ">=", "<="]. This benchmark schema additionally permits == and !=. A candidate model that emits, e.g., operator: "==" will parse successfully here and potentially be judged "correct", but the same output would raise ValidationError in production and be dropped by the retry/fallback path. The reported filter_detection_recall and filter_correctness are therefore optimistic vs. real-world behavior.

Recommend aligning the benchmark schema with production so the evaluation reflects what the pipeline actually accepts:

♻️ Proposed alignment
 class TemporalPredicate(BaseModel):
     field: Literal["created_at"] = Field(default="created_at")
-    operator: Literal["==", "!=", ">", "<", ">=", "<="]
+    operator: Literal[">", "<", ">=", "<="]
     value: str = Field(description='ISO 8601 datetime with timezone, e.g. "2026-03-15T00:00:00+00:00".')

Same change applies to the benchmark copy in eval_query_decomposition.py (lines 108-111). If ==/!= support is ever added to production (e.g., for "on YEAR-MM-DD"), update both production and benchmark together — and note that the v1 prompt's judge rules already mention ==/!= in the JSON grammar hint, so the prompt template should be kept in sync as well.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class TemporalPredicate(BaseModel):
field: Literal["created_at"] = Field(default="created_at")
operator: Literal["==", "!=", ">", "<", ">=", "<="]
value: str = Field(description='ISO 8601 datetime with timezone, e.g. "2026-03-15T00:00:00+00:00".')
class Query(BaseModel):
query: str = Field(description="A semantically enriched, descriptive query for vector similarity search.")
temporal_filters: list[TemporalPredicate] | None = Field(
default=None,
description="Date predicates on created_at, AND-combined. Null when no creation-date restriction.",
)
class TemporalPredicate(BaseModel):
field: Literal["created_at"] = Field(default="created_at")
operator: Literal[">", "<", ">=", "<="]
value: str = Field(description='ISO 8601 datetime with timezone, e.g. "2026-03-15T00:00:00+00:00".')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/prompt_eval/eval_temporal_filter_generation.py` around lines 128 -
139, The benchmark's TemporalPredicate model allows "==" and "!=" but
production's TemporalPredicate (openrag/components/pipeline.py) only permits
">", "<", ">=", "<="; update the operator Literal in TemporalPredicate in this
file (class TemporalPredicate, field operator) to match production by removing
"==" and "!=" so it becomes Literal[">", "<", ">=", "<="], and make the
identical change to the copy in eval_query_decomposition.py so both benchmarks
mirror production validation and scoring.

Comment on lines +238 to +264
def _mock_value(schema: dict, root: dict, user_text: str):
"""Recursively build a mock value that satisfies the given JSON Schema node."""
schema = _resolve_ref(schema, root)
# anyOf / oneOf — nullable fields (union with null) return None to avoid invalid mock values
for combiner in ("anyOf", "oneOf"):
if combiner in schema:
branches = schema[combiner]
has_null = any(s.get("type") == "null" for s in branches)
if has_null:
return None
non_null = [s for s in branches if s.get("type") != "null"]
if non_null:
return _mock_value(non_null[0], root, user_text)
return None
t = schema.get("type")
if t == "object" or "properties" in schema:
return {k: _mock_value(v, root, user_text) for k, v in schema.get("properties", {}).items()}
if t == "array":
items = _resolve_ref(schema.get("items", {}), root)
return [_mock_value(items, root, user_text)]
if t == "string":
return user_text
if t in ("integer", "number"):
return 0
if t == "boolean":
return False
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mock does not honor enum — will break structured-output tests for TemporalPredicate.op.

_mock_value has no branch for enum, so a constrained string like TemporalPredicate.op (e.g., >=, <=, ==) will be mocked as the user query text and fail Pydantic validation on the caller side (which uses with_structured_output(..., strict=True)). This would likely invalidate the new temporal-filter path in tests.

Also minor on the same helpers:

  • Line 231: lstrip("#/") strips any leading #// as a char set; use removeprefix("#/") for correct prefix semantics on JSON Pointer fragments.
  • _resolve_ref/_mock_value have no cycle guard; a recursive $ref schema would recurse infinitely.
🛠️ Proposed fix: honor enum and const
 def _mock_value(schema: dict, root: dict, user_text: str):
     """Recursively build a mock value that satisfies the given JSON Schema node."""
     schema = _resolve_ref(schema, root)
+    # const / enum — pick a concrete allowed value so strict validation passes
+    if "const" in schema:
+        return schema["const"]
+    if "enum" in schema and schema["enum"]:
+        non_null = [v for v in schema["enum"] if v is not None]
+        return non_null[0] if non_null else None
     # anyOf / oneOf — nullable fields (union with null) return None to avoid invalid mock values
     for combiner in ("anyOf", "oneOf"):
-    parts = schema["$ref"].lstrip("#/").split("/")
+    parts = schema["$ref"].removeprefix("#/").split("/")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/api_tests/api_run/mock_vllm.py` around lines 238 - 264, Handle
enum/const and prevent infinite recursion: update _mock_value to check for
"const" and "enum" (return schema["const"] if present; for "enum" return the
first enum value that matches the expected type) before the type branches so
constrained strings (e.g., TemporalPredicate.op) are mocked correctly; in
_resolve_ref replace lstrip("#/") with removeprefix("#/") to correctly strip
JSON Pointer prefixes; add a simple cycle guard by threading a visited set
(e.g., visited_ids) through _resolve_ref and _mock_value to detect and break
recursive $ref loops (return None or a safe default when a cycle is seen).
Ensure you reference the existing function names _mock_value and _resolve_ref
when making changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat Add a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants